Skip to content

[DSv4] Tune default value of VLLM_MULTI_STREAM_GEMM_TOKEN_THRESHOLD#41526

Merged
ywang96 merged 2 commits intomainfrom
tune-threshold
May 3, 2026
Merged

[DSv4] Tune default value of VLLM_MULTI_STREAM_GEMM_TOKEN_THRESHOLD#41526
ywang96 merged 2 commits intomainfrom
tune-threshold

Conversation

@ywang96
Copy link
Copy Markdown
Member

@ywang96 ywang96 commented May 3, 2026

Empirically 1024 is a better default value for turning on multi-stream. In a disgg setting, this env var already disable multi-stream for prefill nodes and enable for decode nodes (since it's rare to have one rank handling > 1024 requests), which is what we intend to have.

For agg setup, data on blackwell indicates 1024 being a better default than 4096.

GB300 DEP4, 8k/1k

  ┌───────────┬────────────────────┐
  │ Threshold │ Throughput (tok/s) │
  ├───────────┼────────────────────┤
  │ 1024      │ 25,881.9           │
  ├───────────┼────────────────────┤
  │ 256       │ 25,529.8           │
  ├───────────┼────────────────────┤
  │ 512       │ 24,351.5           │
  ├───────────┼────────────────────┤
  │ 2048      │ 23,443.1           │
  └───────────┴────────────────────┘

GB200 DEP8, 8k/1k

  ┌───────────┬────────────────────┐                                                                                                  
  │ Threshold │ Throughput (tok/s) │                              
  ├───────────┼────────────────────┤                                                                                                  
  │ 4096      │ 33,435.4           │
  ├───────────┼────────────────────┤                                                                                                  
  │ 1024      │ 33,044.5           │                                                                                                  
  ├───────────┼────────────────────┤
  │ 2048      │ 33,042.6           │                                                                                                  
  ├───────────┼────────────────────┤                              
  │ 512       │ 32,085.7           │                                                                                                  
  ├───────────┼────────────────────┤
  │ 128       │ 32,015.5           │                                                                                                  
  ├───────────┼────────────────────┤                              
  │ 256       │ 31,323.8           │                                                                                                  
  └───────────┴────────────────────┘

Purpose

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Co-authored-by: Copilot <copilot@github.com>
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.

Tip: disable this comment in your organization's Code Review settings.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the default value for VLLM_MULTI_STREAM_GEMM_TOKEN_THRESHOLD from 4096 to 1024 in vllm/envs.py. A review comment points out that the existing documentation comments on lines 1689-1690 are now outdated and should be updated to align with the new threshold and its rationale.

Comment thread vllm/envs.py
# is ~4096; B200 (132 SMs) is expected ~3072.
"VLLM_MULTI_STREAM_GEMM_TOKEN_THRESHOLD": lambda: int(
os.getenv("VLLM_MULTI_STREAM_GEMM_TOKEN_THRESHOLD", "4096")
os.getenv("VLLM_MULTI_STREAM_GEMM_TOKEN_THRESHOLD", "1024")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The documentation comment on lines 1689-1690 is now outdated and contradicts the new default value. It states that the empirical crossover is ~4096 for B300 and ~3072 for B200, which no longer aligns with the decision to set the default to 1024 based on the new empirical data provided in this PR. Please update the comment to reflect the current findings and the rationale for the 1024 threshold to avoid confusing users and developers.

Co-authored-by: Copilot <copilot@github.com>
@ywang96 ywang96 merged commit 856ec48 into main May 3, 2026
7 of 9 checks passed
@ywang96 ywang96 deleted the tune-threshold branch May 3, 2026 01:32
ywang96 added a commit that referenced this pull request May 3, 2026
…#41526)

Co-authored-by: Copilot <copilot@github.com>
joa-stdn pushed a commit to joa-stdn/vllm that referenced this pull request May 4, 2026
…vllm-project#41526)

Co-authored-by: Copilot <copilot@github.com>
Signed-off-by: Joachim Studnia <joachim@mistral.ai>
chaojun-zhang pushed a commit to chaojun-zhang/vllm that referenced this pull request May 6, 2026
…vllm-project#41526)

Co-authored-by: Copilot <copilot@github.com>
Copilot AI pushed a commit to hongbolv/vllm that referenced this pull request May 7, 2026
…vllm-project#41526)

Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: hongbolv <33214277+hongbolv@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant